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Abstract 

Community structure detection in complex social networks has become a hot 
research topic since it can help better understand the network topology, such 
as revealing the functional modules in the network, and how the network 
works. However, there is still not a clear and widely-accepted definition of 
community structures, and most of the detected communities are model- 
based, which makes the results hard to explain. In this paper, different from 
the traditional methodologies, we design two enhanced community structure 
detection frameworks to incorporate some background information related 
with the functions of the nodes. It will give new insights into the community 
discovery problem and improve the interpretability of the results. In the 
proposed frameworks, there are user-computer interactions, and the results 
are based on not only the topology of the network, but also the functions 
of the nodes. The experiments on both the synthetic and the real-world 
networks have confirmed the effectiveness of the framework. 

Keywords: Community Structure Detection, Semi-supervised Learning, 
Nonnegative Matrix Factorization, Spectral Clustering 



1. Introduction 

Community structure detection in complex social networks (pfl |2]) is of 
critical importance for understanding not only the network topology, but also 
how the network works. In many real applications, the revealed communities 
often correspond to functional modules of the network, such as pathways 
in metabolic networks, or a group of people that have a common interest. 
These functional modules can be considered building blocks of networks. 
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Furthermore, network dynamics in the networks with community structures 
can be very different from those without communities. 

However, it is very hard to give a general and widely-accepted definition 
of community structures due to the complexity of real problems, and most 
of the revealed communities are model-based, which makes the results hard 
to explain ([? ]) for a new dataset, or in other words, the correctness and 
meanings of the communities can not be confirmed without the background 
information about the functions of the nodes. Hence if the background in- 
formation can be effectively incorporated to guide the process of community 
structure detection, we could get much better results. An elementary study 
can be found at ref. [3]. 

In this paper, we will systematically do a research on how to combine 
different types of prior knowledge with the models of community structure 
detection, and give two enhanced frameworks for network analysis. Under the 
proposed frameworks, one can easily incorporate i) the pairwise constraints 
on whether the nodes in the pair are in the same community structure or 
not; or ii) the constraints of partial community labels into the adjacency 
matrix of the network. The experimental results show that the proposed 
methodologies can significantly improve the detection performance. It will 
give new insights into the community discovery problem and improve the 
interpretability of the results. 

In summary, the contributions of this paper are three folds: i) We pro- 
pose a framework that can incorporate the prior information of must-link 
(ML) & cannot link (CL) type. Note that the framework proposed here is 
more complete than that in ref. [3]. ii) We propose a framework that can 
incorporate the prior information of class label type, iii) The experimental 
results confirm the effectiveness of the proposed framework. 

The rest of the paper is organized as follows: Sect. [2] discusses how 
to incorporate prior information, including both the ML and CL type and 
the class-label type, to guide the community structure detection process, 
and briefly introduces nonnegative matrix factorization (NMF) and spectral 
clustering, which are applied for the detection, Sect. [3] gives the experimental 
results, and finally Sect. [3] concludes. 
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2. Enhanced semi-supervised learning for community structure de- 
tection 



Ay 



In this section, we discuss our enhanced learning frameworks for commu- 
nity structure detection. Firstly, given an undirected and unweighted simple 
graph G, we give the definition of the associated symmetric adjacency matrix 
A as follows: 

1, if i ~ j, or i = j, 
0, if i *>j, and i ^ j, 

where i ~ j means there is an edge between nodes i and j, and % oo j means 
there is no edge between them. 

In many real applications, there is often some prior information available, 
such as nodes i and j are in the same community, or nodes t and k are not in 
the same community, or the community label of node p is 1 and the label of 
node q is 4. We can try to incorporate this information into the community 
structure detection process to make the result more explainable and clear, 
which leads to the enhanced learning frameworks. Consequently, we will give 
the details of the frameworks. 

2.1. ML and CL type information 

Specifically, if we have known that some pairs of nodes are ML (the two 
nodes are in the same community), or some pairs of nodes are CL (the two 
nodes are not in the same community), or both, we can incorporate these 
pairwise constraints into the adjacency matrix A to get a new matrix B^ as 
follows: 

{a, if i and j are must link (ML) 
0, if i and j are cannot link (CL) (1) 
A^ otherwise, 

Based on logical inferences, one can get further knowledge of the constraints 
that i) if nodes i k t are ML, and nodes i k k are ML, then t k k are also 
ML; ii) if nodes i & t are ML, and nodes % &i k are CL, then t & k are also 
CL, which leads to the following revision of B^: 

{a, if i & t are ML, and i & k are ML 
0, if i k t are ML, and i k k are CL (2) 
Btfj otherwise, 

In this paper, we call this logical inferences step as information-enhanced 
step. After this step, the prior knowledge is more fully used, and its effec- 
tiveness will be shown in Sect. HI 
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p=0 p=5% p=\0% p=20% 

Increasing Percentage of Pairs Constrained 

Figure 1: An illustrative example to show the effectiveness of information enhancement, p 
is the percentage of pairs constrained. For the GN network, there are 128 x (128 — l)/2 = 
8128 node pairs available, and p = 5% means 406 pairs of nodes. 

We set a to 2 in this work ([3]). 

2.2. An illustrative example 

In this subsection, we use a GN network to show the effectiveness of our 
approach: given an undirected and unweighted "equally sized four groups" 
network with 128 nodes, we want to detect the community structures in it. 
The heatmap of the associated adjacency matrix A is shown as the upper 
left in Fig. [TJ Suppose that we have some prior information about the 
functions of the nodes, and can thus determine a percent of pairs of nodes 
as must-link (ML) or cannot link (CL). These pieces of information on ML 
and CL are incorporated into the adjacency matrix A. As expected, the 
community structures becomes more and more clear as the percentage of 
pairs constrained rises. However, an surprising observation is the effectiveness 
of logical inferences, which has dramatically improved the data quality (the 
second row in Fig. [I]). 

2.3. Class label type information 

In practice, the ML and CL constraints are often obtained from known 
labels. If we have known a percent of class labels of nodes, we can also 
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p=0 



p=W% p=50% 



p=1 



► 

Increasing Percentage of Known Label 



Figure 2: An illustrative example to show the effectiveness of incorporating the information 
of known labels into the adjacency matrix, p is the percentage of known label. For example, 
for the GN network, p = 10% means (~)13 nodes. 

incorporate them into the adjacency matrix A as follows: 



Note that there is no need to do the information-enhanced step under this 
case. 

2.4- An illustrative example (continued) 

Again, we use a GN network to show the effectiveness of incorporating the 
information of known labels into the adjacency matrix. If we have known the 
class labels of some nodes, we can then incorporate them into the adjacency 
matrix A, and as can be observed in Fig. [2j the community structures, again, 
becomes more and more clear as the percentage of known labels rises. 

After incorporating the background information into the adjacency ma- 
trix, we can then apply some unsupervised learning models, such as non- 
negative matrix factorization (NMF) and spectral clustering, on them for 
community structure detection. 

2.5. Nonnegative Matrix Factorization (NMF) 

The model of NMF is often formulated as the following nonlinear pro- 
gramming ( jH E] ) : 




a, if % and j are known to have the same class label 
0, if i and j are known to have differnt class labels 
Aij otherwise, 



(3) 



mm 

W,H 

s.t. 



\\X-WH 



W,H^0, 
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or in other words, given the nonnegative adjacency matrix X of size n x n, 
where n is the number of the nodes in the network, we try to find two 
nonnegative matrices, W of size n x k and H of size k x n, such that: X « 
WH. The objective matrix X for NMF can be selected as A, or 
AA. The community structures of the network can be revealed from H: 
node i is of community k if is the largest element in the ith column of H. 
The algorithm of multiplicative update rules for NMF can be summarized in 
Algorithm [TJ 

Algorithm 1 Nonnegative Matrix Factorization (Least Squares Error) 
Input: X, iter % In this paper, the iteration number iter is set to 100. 
Output: W,H. 
1: for t — 1 :iter do 

2: W lk := W ik {XHT) ' k 



3: Hik '■— Hik 



(WHH T )ik 

ik 



(W T WH) ik 
4: end for 



If a percent of the community labels of nodes is known, we can also use 
the joint matrix factorization model ([6]). The model can be expressed as: 

min \\X-WH\\ 2 F + X\\L.*(Y- QH)\\l 

W,Q,H 

s.t. W,Q,H^0, 

where .* means element-wise multiplication, Y is the label indicator matrix 
of size k x n, and Y au is 1 if node u is of community a and otherwise. L is 
the weight matrix to handle the known labels, and defined as follows: 

^ f Ik, if the label of j is known , 
:J \ 0) if t ne label of j is not known 

where L.j is the jth column of L, and = [1, 1, • • • , 1] T G 

The corresponding multiplicative update rules of W, Q and H are as fol- 
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lows: 



w ik ■.= w ik ' nh 



(XH T ^ 



Qik • Qik 



{W HH T )ik 

1 ttT\ 

ik 



([L. * Y]H T ) 



([L.*(QH)]H T ) lk ' 

(W T X + \Q T [L.*Y]) ik 



'{W T WH + \QT[L.*{QH)]) lk 
We set A to 1 in this paper, and H can be initialized as s = L. * Y. 

2.6. Spectral Clustering 

Spectral clustering is another powerful tool for unsupervised learning 
The standard algorithm can be summarized in Algorithm [2] ( [7] ) . 



Algorithm 2 Spectral Clustering 
Input: B e R nxn 

Output: Community Label Y G IR" xl of the n nodes 
I: L — D l l 2 BD l l 2 , where D is the diagonal matrix with the element 

Ej % • 

2: Formimg the matrix X = [x\,X2,--- ,x k ] G IR™ xA: , where x^i 

1, 2, • • • ,k are the top k eigenvectors of L. 
3: Normalizing X so that rows of X have the same L 2 norm: 

x ij/(E,j x ij) 1/2 - 

4: Clustering rows of X into k clusters by K-means. 
5: Yi = j if the iih row is assigned to cluster j. 



3. Experimental Results 

In this section, we empirically tested the effectiveness of our proposed 
enhanced semi-supervised learning frameworks for community structure de- 
tection. To do this, we applied NMF and spectral clustering with the revised 
adjacency matrices to several well-studied networks. 
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3.1. Data Description 

Both the synthetic and the real-world networks were used in our experi- 
ments. Details are as follows: 

1) GN (Girvan & Newman, PQ): The GN network, also known as the "four 
groups" network, has 128 nodes which are divided into four equalized sized 
non-overlap communities with 32 nodes each. On average, each node has 
Zin+Zout=16 neighbors, or in other words, it randomly connects with Zin 
nodes in its own community and Zout nodes in other communities. As 
expected, with an increasing Zout, the community structures will become 
less and less clear and the problem more challenging. In our experiment, 
we set Zout to 10 and Zin to 6. 

2) LFR (Lancichinetti, Fourtunato Sz Radicchi, [8]): Compared with GN 
networks, LFR networks are more practical. In LFR networks, both the 
degree and the community size distributions obey power laws, with expo- 
nents 7 and /3. Each node has a fraction 1 — fi of its neighbors in its own 
community and a fraction \x in other communities. Here \i is called the 
mixing parameter. 

We set the parameters of the LFR network as follows: the number of 
nodes was 1000, the average degree of the nodes was 20, the maximum 
degree was 50, the exponent of degree distribution 7 was 2 and that of 
community size distribution (5 was 1, and the mixing parameter fi was 
0.9. The communities were non-overlapped. 

3) Football team network ([I]): this dataset is about the network of Ameri- 
can football games (not soccer) between 115 college teams during regular 
season Fall 2000. The edges connect the teams that had games. The 
teams were divided into 12 conferences, and all teams except few played 
against the ones in the same conference more frequently than those in 
other conferences. 

3.2. Assess Standards 

In this submission, we used the normalized mutual information (NMI, 
[H]) to assess the effectiveness of our approach. The value can be formulated 
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as follows: 



E E nij log -T^y 
i=ij=i n, n) 

NMI(M 1} M 2 )= ' ' 



\ 



k (!) k r) (2) 

(Enf ) log^)(^nf log^) 



where Mi is the ground-truth community label and M2 is the computed 
community label, is the community number, n is the number of nodes, 
is the number of nodes in the ground-truth community % that are assigned to 

the computed community j, rJp is the number of nodes in the ground-truth 

(2) 

community i and rij is the number of nodes in the computed community j, 
log is the natural logarithm. 

In general, the higher the NMI, the better the result. 

3.3. Results Analysis 

Consequently, we compared the results of NMI obtained by the models 
with and without prior information available. Noth that, for the classla- 
bel type prior information, we deleted the nodes with known labels when 
computing NMI to make a fair comparison, or in other words, the results 
can clearly reflect the effects of the prior information on partitioning the 
nodes with unknown labels. The results were averages of ten trails and have 
summarized in Fig. [3] and Fig. |4j From these figures, one can observe that: 

i) For the ML and CL type prior information, 1) the averaged NMI of 
the semi-supervised learning with and without the information-enhanced 
step rises with the increasing percentage of ML and CL pairs constrained; 
2) the information-enhanced step does significantly increase the detec- 
tion performance. For example, given 5 percent of pairs constrained 
on GN networks, the NMI of the non-enhanced semi-supervised NMF 
(= 53.28%) is forty percent higher than that of the unsupervised one 
(= 10.50%), and the NMI of the enhanced NMF (= 85.11%) is fur- 
ther improved by more than thirty percent; 3) the spectral clustering is 
slightly better than NMF. 

ii) For the classlabel type prior information, 1) the averaged NMI of the 
semi-supervised learning rises with the increasing percentage of known 
labels; 3) the results of NMF on the revised adjacency matrices A A 
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are comparable to that of the joint NMF, which means that the joint 
NMF may not be necessary, and the framework for incorporating the 
constraints of partial community labels can also be model- independent, 
just like that for ML & CL; 4) the improvement of NMI may not be as 
remarkable as that given ML and CL type prior information, especially 
for the LFR networks, but the workload is smaller than that with ML 
and CL type constraints (for example, on GN network, 5 percent of pairs 
constrained means 0.05 x 128(128 — l)/2 « 406, and 5 percent of known 
label means 0.05 x 128 ~ 6); 5) NMF is slightly better than spectral 
clustering for the LFR networks, whose structures are more complex. 

3.4- A Case Study: College Football Network 

In this subsection, we analyzed the football team network study 
to show the effectiveness of our semi-supervised learning framework for ML 
& CL type. 

In the football network, there are 115 nodes (teams), and they belong to 
12 different conferences. Most of them played against the ones in the same 
conference more frequently. However, there are also abnormal teams that 
played more frequently against the ones in other conferences, including the 
teams 37, 43, 81, 83, 91 (in conference IA Independents), 12, 25, 51, 60, 64, 
70, 98 (in conference Sunbelt), 111, 29 and 59. For more details, please refer 
to the ref. [3]. 

Combined with our previous work ([3]), we set the community number 
to 11, and the teams 37, 43, 81, 83, 91 did not have ground-truth conference 
labels, or in other words, there are 110 labeled teams and 110 x (110 — 1)/2 = 
5995 team pairs available. Firstly, we randomly selected some pairs in them 
as prior information: if the two teams were in the same conference, they 
were must-link (ML), otherwise, they were cannot-link (CL), and then we 
proceeded to do the information-enhanced step. Finally, we applied NMF on 
the revised adjacency matrices to give the partitioning results. 

Fig. [5] gives the partitioning results of NMF corresponding to differ- 
ent percent of pairs constrained (with and without logical inferences), from 
which, one can observe that: 

i) When no prior information was given, there were five abnormal teams 
mis-clustered: teams 29, 60, 64, 98, 111. 
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Figure 3: Averaged NMI of NMF and spectral clustering under different percent of node 
pairs constrained on GN and LFR networks. A denotes NMF (or spectral clustering) on 
the standard adjacency matrix A, B^ denotes NMF (or spectral clustering) on the revised 
adjacency matrix B^\ and B^ denotes NMF (or spectral clustering) on the revised matrix 
BW. The inset compares the NMI results obtained by NMF and spectral clustering on 
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Figure 4: Averaged NMI of NMF and spectral clustering under different percent of node 
pairs constrained on GN and LFR networks. A denotes NMF (or spectral clustering) on 
the standard adjacency matrix A, AA denotes NMF (or spectral clustering) on the revised 
adjacency matrix AA, and A + Y + s denotes joint NMF on the standard A and the 
community label matrix Y, and H is initialized as s = L. * Y. The inset compares the 
NMI results obtained by NMF on A + Y + s and spectral clustering on AA. 
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) When given 5 percent of pairs constrained, the number of ML and CL 
pairs of constrained without the information-enhanced step was 300 (5 
percent of node pairs available, before logical inferences), and rose to 
1130 (18.85 percent, after). There were three abnormal teams mis- 
clustered: teams 29, 64, 111. An interesting observation here is that 
the team 64 was mis-clustered into two different conferences (conference 
5 and conference 10) under the non-logical-inferences learning and the 
logical-inferences learning. The reasons are as follows: in this experi- 
ment, before the information-enhanced step, there were 7 CL pairs and 
no ML pairs related with node 64, among which, two were related with 
the conference 10 while none was related with the conference 5. After 
the information-enhanced step, there were 18 CL pairs and no ML pairs 
related with 64, among which, seven were related with the conference 
10 and there were still no pairs related with conference 5. The result 
was thus guided by the enhanced prior information and the team 64 
was assigned to conference 5. Note that since the pairs of constrained 
are selected randomly, another round of the experiment may result in 
different network partitions. Table [I] gives more details about the ML 
and CL pairs related with the team 64. 

) When given 20 percent of prior knowledge, the number of ML and CL 
pairs of constrained was 1199 (20% percent, before enhancement), and 
rose to 5651 (94.26% percent, after). There was only one team 111 
mis-clustered into the conference 12 under the non-enhanced learning. 
After enhancement, teams of the conference 12 were all in the CL set 
of the team 111, which was very helpful to the community structure 
detection process. Table [2] gives more details about the ML and CL 
pairs related with team 111. The team 64 was clustered correctly. Be- 
fore the information-enhanced step, there were 30 CL pairs and 3 ML 
pairs related with 64, and after enhancement, there were 102 CL pairs 
(including almost all the labeled teams that are not in conference 11) 
and 6 ML pairs (including all the labeled teams in conference 11) related 
with 64. 
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Tabic 1: Must link and cannot link pairs of teams related to team 64 and the teams in 
con ference 10 and 5. The boxed nodes arc that included in ML or CL. 



Node 64 


ML 


CL 


Non-enhanced 


none 










21 


44, 


57, 59, 72, 85 


107 










Enhanced 


none 


19, 21, 27, 39, 


44, 


55, 57, 59, 62, 66, 


72, 


85, 86, 


88, 


96, 97 


107, 114 


Conference 10 




18, 


21 


, 28, 


57 


, 63, 


66 


, 71, 77, 


88 




96 




97 




114 




Conference 5 








45, 


49, 


58, 67, 76, 87, 92, 


93, 


111, 113 









Table 2: Must link and cannot link pairs of teams related to team 111 and the teams in 
conference 5 and 12. The boxed nodes are that included in ML or CL. 



Node 111 


ML 


CL 


Non-enhanced 
Enhanced 


45, 67 
45, 67 


3, 7, 10, 13, 15, 18, 24, 26, 27, 46, 79, 82, 88, 101 
a total of 95 nodes 


Conference 5 
Conference 12 




45 


49, 58, 


67 , 76, 87, 92, 93, 111, 113 


29 , 


47 , 


50 , 54 


, 59 , 68 , 74 , 84 , 89 , 115 



In summary, our semi-supervised learning framework did make better 
use of the prior information and could significantly improve the model per- 
formance. 

4. Conclusions and Future Works 

How to improve the clustering performance using prior information is an 
important problem in semi-supervised learning society. In this paper, we 
have proposed two enhanced learning frameworks for community structure 
detection in social networks. The two frameworks can add either the su- 
pervision of must-link (ML) and cannot-link (CL) pairwise constraints or 
the class-label constraints into the adjacency matrices. For the ML and CL 
type constraints, we have further done the information-enhanced step based 
on logical inferences. Note that this step is only feasible under the case of 
non-overlapped community structures. If otherwise, for example, node i has 
multiple community labels, the prior information that nodes i and t are ML 
and nodes i and k are CL may not necessarily result in the fact that nodes t 
and k are also CL. The numerical results on both the synthetic and real-world 
networks have confirmed the effectiveness of the proposed frameworks. 

Indeed, in practice, it is possible that we have a hybrid of prior informa- 
tion on the class- label and ML & CL, i.e., some prior information is on ML 
and CL constraints, and some other information is on class-label constraints, 
and our proposed frameworks can be trivially extended to this case. 
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(a) 



(b) 




Figure 5: Comparison of the semi-supervised learning results of NMF with and without the 
information-enhanced step corresponding to different percent of pairs constrained (color 
online), (a): Real grouping in football dataset. There are 12 conferences of 8-12 teams 
(nodes) each. Teams in conference 6 are not labeled, (b): Result of NMF without any 
prior information, (c): Result of NMF given 5 percent of pairs constrained (without the 
information-enhanced step), (d): Result of NMF given 5 percent of pairs constrained 
(with), (e): Result of NMF given 20 percent of pairs constrained (without), (f): Result of 
NMF given 20 percent of pairs constrained (with) , and all the labeled teams are corrected 
clustered. \ 5 



An interesting problem which is related with our work is the analysis of 
dynamic networks, such as detecting the communities in a series of time- 
varying networks. Given the network structure at time t, we can find some 
conservative relationships between nodes and use them as ML and CL con- 
straints to detect the communities in the new network at time t + 1, which 
is termed by us as online semi-supervised learning. 
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